Circular stable intronic RNAs possess distinct biological features and are deregulated in bladder cancer

Abstract Until recently, intronic lariats were regarded as short-lasting splicing byproducts with no apparent function; however, increasing evidence of stable derivatives suggests regulatory roles. Yet little is known about their characteristics, functions, distribution, and expression in healthy and tumor tissue. Here, we profiled and characterized circular stable intronic sequence RNAs (sisRNAs) using total RNA-Seq data from bladder cancer (BC; n = 457, UROMOL cohort), healthy tissue (n = 46), and fractionated cell lines (n = 5). We found that the recently-discovered full-length intronic circles and the stable lariats formed distinct subclasses, with a surprisingly high intronic circle fraction in BC (∼45%) compared to healthy tissues (0–20%). The stable lariats and their host introns were characterized by small transcript sizes, highly conserved BP regions, enriched BP motifs, and localization in multiple cell fractions. Additionally, circular sisRNAs showed tissue-specific expression patterns. We found nine circular sisRNAs as differentially expressed across early-stage BC patients with different prognoses, and sisHNRNPK expression correlated with progression-free survival. In conclusion, we identify distinguishing biological features of circular sisRNAs and point to specific candidates (incl. sisHNRNPK, sisWDR13 and sisMBNL1) that were highly expressed, had evolutionary conserved sequences, or had clinical correlations, which may facilitate future studies and further insights into their functional roles.

(3' SS), generate the intronic lariat byproduct containing the unique 2'-5' phosphodiester linkage ( 3 ). The closed lariats are generated in an equal amount to the spliced exon junctions (mature mRNA) but are usuall y ra pidl y turned over. While most spliced mRNAs are transported to the cytoplasm, intronic lariats remain in the nucleus. Here, the ratelimiting step of laria t degrada tion is the Dbr1 induced hydrolysis of the 2'-5' linkage. This leaves a linear debranched intronic RN A molecule, w hich is usuall y quickl y degraded by exonucleases ( 4 ).
Stable lariats were first reported in T-cells ( 5 ) and viruses (6)(7)(8)(9). La ter, identifica tion was made of intronic RNAs tha t accumula ted in the oocyte nucleus of the frog Xenopus tropicalis and showed an unusual stability for up to two days, termed stable intronic sequence RNAs (sisRNAs) ( 10 ). While the nuclear sisRNAs exist as both lariats and linear molecules, sisRNAs in the cytoplasm of Xenopus tropicalis oocyte are only found in the form of circular sisRNAs ( 11 ); a class later described as comprising both stable lariats without tails, which escape debranching, and full-length intronic circles with a 3'-5' linkage ( 12 ).
A complete circularization of the entire intron (3'-5' circles) has also been observed in transcriptome-wide branchpoint-mapping studies in human cells ( 13 , 14 ). These studies found that 3% of all inferred branch points mapped to the last intron position (3' SS), creating full-length intronic circles. Intronic cir cles ar e suggested to originate from conventional lariats through a third nucleophilic attack on the 2'-5' linkage or as a subsequent debranching-and-ligation reaction ( 13 , 14 ). While se v eral intronic circles have been found ( 12 , 15 ) and validated ( 14 , 16 , 17 ), their stability, localization, and function have yet to be determined.
Se v eral studies, across multiple species, have found that the majority of stable lariats (possessing the 2'-5' branching junction) contain a cytosine (C) at their branch point instead of the canonical adenine (A) ( 12 , 18-20 ). Cytosine at the branch point may explain the resistance of stable lariats to the Dbr1 enzyme, which is inef ficient a t hydrolyzing the 2'-5' linkage of C-branched lariats ( 21 , 22 ). Circular sis-RNAs can also escape debranching through physical separation from Dbr1, which is achie v ed by e xport to the cytoplasm. Cytoplasmic circular sisRNAs have been identified in cells fr om zebrafish, fr og, chicken, mouse, and human ( 12 ). The cytoplasmic stable lariats are generally deri v ed from short introns ( < 500 bp), do not colocalize with their parent mRNA, and are exported to the cytoplasm via the NXF1 / NXT1 machinery ( 12 ).
Though the majority of both linear and circular sisR-N As remain functionall y uncharacterized, some are shown to have a specific role assigned to them ( 17 , 19 , 23-25 ). One example is circular molecules pr ocessed fr om lariats that escape debranching in human cell lines, termed circular intronic long non-coding RN As (ciRN As) ( 19 ). In particular, ci-ankrd52 associates with the polymerase II transcription machinery and positi v ely regulates parent gene expression ( 19 ). Howe v er, the functional roles of circular sisRNAs in humans during health and disease are still largely unknown.
The role of non-coding RN A (ncRN A) in cell homeostasis is increasingly appreciated with a rising number of publications reporting ncRNA deregulation during a wide range of diseases (26)(27)(28)(29). Cancer is a heterogeneous dis-ease characterized by genomic instability and altered transcription patterns, with bladder cancer (BC) reported as the ninth most commonly occurring malignancy world-wide ( 30 ). Earl y stage BC (non-m uscle invasi v e b ladder cancer; NMIBC) is characterized by a low mortality rate but high r ecurr ence rate leading to long-term patient-follow-up in the form of frequent and e xpensi v e cystoscopic monitoring, to avoid disease progression to a lethal muscle invasi v e stage ( 31 , 32 ). Discov ering biomar kers that can help stratify tumors based on risk of progression is vital for patient-specific treatment and follow-up ( 33 ).
Despite examples of individual sisRNAs, large-scale transcriptome-wide studies remain few as sisRNAs are absent from common data sets with polyA-selected mRNAs. Besides, exact identification of circular sisRNAs relies on precise alignments of the sequencing reads traversing the 5' SS / BP splice junction (Figure 1 A). While the re v erse transcriptase (RT) is capable of reading through the nonconventional 2'-5' bond, such e v ents occur with low efficiency; and the RT is more likely to pause, fall off , or under go template switching at this position resulting in mismatch errors, microinsertions, or deletions in the DNA product ( 34 ).
Howe v er, by analyzing sequenced total RNA from large sample collections, we were able to identify reads traversing the intronic lariat junction. Here, we identify and characterize circular sisRNAs in NMIBC. We show that they possess distinct biologic features potentially promoting stability, are present in multiple cellular compartments, have a tissuespecific expression pattern, comprise a highly conserved stable lariat subclass, and have an overall differential expression between prognostic risk classes in early-stage bladder cancer. This constitutes the first comprehensi v e analysis of circular sisRNAs in cancer.

Identification and expression profiling of circular sisRNAs
The primary data set is a cohort of 457 non-muscle invasi v e b ladder cancer (NMIBC) samples, which has been whole-transcriptome sequenced and clinically annotated in a previous study ( 35 ). We used STAR ( 36 ) to map the total RNA-Seq reads to the human r efer ence genome hg38 ( http://hgdo wnload.soe.ucsc.edu/do wnloads. html ). Gi v en the chimeric reads as input, we ran CIRC-explorer2 with default settings to detect circular sisRNAs and the number of cir cular-junction-supporting r eads ( 37 ). The output from CIRCexplorer2 was compared with other cir cRNA detection and corr ection methods, namely, DCC ( 38 ), circRNA Finder ( 39 ), and CIRIquant ( 40 ) (run on the full CIRCexplorer2 output) to investigate the level of overlap in their output. All pipelines were run with default settings. As only some methods provide elaborate annotation of whether a circRNA junction origin is exonic or intronic, we annotated all detected circular junctions using a none xon-ov erlapping intron set.

Extracting unique intron annotations
We used the R packages "GenomicFeatures" and "Genom-icRanges" ( 44 ) to extract genomic positions for all introns based on the GENCODE (v. 33) gene annotations ( 45 ). Isoforms were disregarded by collapsing overlapping intron annotations. To av oid ex onic influence on read-depth and conserv ation, we ex cluded intronic positions overlapping exons from an alternative isoform. The introns have a large length variability, ranging from 1 bp to 721 292 bp (median = 1461 bp, n = 183 203, Supplementary Figure S1). To reduce the presence of artifacts and non-canonical introns, we only considered introns with a length between 100 and 10 000 bp ( n = 153 582). Finally, we annotated the predicted circular sisRNAs with our resulting intron annotations based on genomic range overlap and close proximity of their 5' coordinates (within 10 bp).
When relevant, we used BEDtools ( 46 ) to extract stranded sequences from the r efer ence genome and SAMtools ( 47 ) to filter reads.

Generating intron and gene expression matrices
We generated intron and gene expression matrices for all data sets mentioned above using HTseq count ( 48 ) in stranded mode and with loose read and intron overlap restrictions. Thus, pr e-mRNA r eads partially overlapping an intron were also counted. The relati v e intron e xpression is defined as (intron FPKM value) / (parent mRNA FPKM value), with a pseudocount of 1e-06 added to both numera tor and denomina tor. For the ENCODE tissue data, we defined the tissue-specific expression as the average expression for all samples within each tissue.

RNase R enrichment of circular sisRNAs
The total RNA-Seq immortalized human fibroblast (HS68) cell data, produced by Jeck et al. , consists of two replicates of RNase R treated and control samples. RNase R enzymatically degrades linear RNAs causing an enrichment in circular RNA species in the treated samples compared to the controls. We used the CIRCexplorer2 generated output to determine the number of circular-junction reads, with a threshold of at least two junction-spanning reads imposed. We counted the number of linear-splice reads across each intron using HTseq count after filtering reads with SAMtools. For each intron, we normalized the number of circular and linear junction reads as reads per million (RPM). Only circular junctions present in both the treated and control sample were included, and only linear junctions having a circular counterpart were included. We calculated the le v el of enrichment as the expression ratio of junctions in the RNAse R treated sample against its corresponding control sample. Enrichment was then defined as a ratio larger than one.

Validation of select circular sisRNAs
We chose a small set of circular sisRNAs for validation based on their expression level in the NMIBC samples and on their detection in total RNA-Seq data from the cell line used ( 35 ).
We re v erse tr anscribed 1 ug RNA from J82 and fr actionated HepG2 cells in a 20 ul reaction with or without 1 ul MLV-RT enzyme (Thermo Fisher) but otherwise adhering to manufacturer's protocol. Then, we PCR amplified 2% with taq DN A pol ymerase (Thermo Fisher) for 35 cycles using primers listed in Supplementary Table S1. PCR fragments were visualized on a 1% agarose gel with 1X SYBR Safe (Invitrogen).

Statistical analyses
We performed all statistical analyses in R ( 49 , 50 ) and generated plots using "ggplot2" ( 51 ). As a measure of central tendency we pr eferr ed medians for robustness to outliers and distributional asymmetry. In addition, we quantified variation using the standard deviation, and when relevant the interquartile range (IQR), which is explicitly stated when used. We used the two-sided non-parametric Wilco x on rank-sum test to evaluate differential distributions. For comparing expression levels in BC risk classes, we r equir ed cir cular sisRNAs to have mor e than ten junction reads across ten samples. We also performed Fisher's exact test to evaluate the association between junction type (circular or linear) and RNase R enrichment for our discrete count data and used odds-ratio to quantify the association strength. Spearman's rank correlation coefficients were used to assess the relationships between pairs of stochastic variables. We conducted Kaplan-Meier survival analysis using a log-rank test with plots from the R packages "Survival" ( 52 ) and "Survminer". In instances of multiple testing, Benjamini-Hochberg (BH) was applied to control the false discovery rate (FDR) with q < 0.1 considered significant. In addition, we report the fold change (FC) as a measure of the central difference between two distributions of interest and define it as the ratio between their means (a1 and a2), with a pseudocount added:

Branch point annotations and related features
To annotate BP positions within all intron sets, we analyzed a comprehensi v e set of e xperimentall y ma pped and statistically inferred BP positions (here named Taggart BPs or simpl y BPs), w hich spans 16.8% of all human introns ( 14 ). We lifted their genomic coordinates to hg38 using UCSC's w e b tool "Lift Genome Annotations" ( 53 ) and assigned them to our intron annotations (Supplementary Methods). We defined the initial 3' tail length as the distance between the branch point and 3' splice site. All intronic lariats with negati v e 3' tail lengths or lengths ≤1 bp, which may result from the truncation of non-exon overlapping intron annotations, were disregarded.

Sequence context
We extracted sequences located from the 5'SS to 20 bp downstream of the 5'SS; 20 bp upstream of the 3' SS to the 3'SS; and spanning 10bp on either side of the BP for all introns. For the BP sequences, all positions within 10 bp of the 3' SS position were excluded from the sequence analyses to avoid effects on the conservation le v els, etc. As a consequence, all introns with a BP to 3' SS distance < 10 bp were disregarded. We computed nucleotide frequencies using the "Biostrings" R package ( 54 ).
Bits were used as a measure of the information content at each position in the sequences, calculated using Kulback-Leib ler di v ergence, and visualized as logo plots using the "ggseqlogo" R package ( 55 ) (Supplementary Methods).

De novo motif discovery and analysis
We used STREME called by the XSTREME tool (with default settings), for comprehensi v e motif discov ery ( 56 , 57 ). STREME (Sensiti v e, thorough, rapid enriched motif elicitation) is geared for large sequence sets ( n > 50) and performs ungapped motif discovery ( 58 , 59 ). Motif significance is gi v en b y E -v alues, which measure the expected number of motifs with as high an enrichment in a random sequence set of the same size ( n = 458).
As positi v e input, we used 101 bp long sequences centered at the branch points of stable lariats (based on CIRC-explorer2 positions). To remove potential splice site motifs, we trimmed ten bases at the splice site ends of the stable lariat host introns. As negati v e control input, we used similarly defined sequences centered at Taggart BPs from non-host intr ons fr om expressed genes ( n = 30 836).
We used FIMO (Find individual motif occurrence) ( 60 ) for locating exact motif occurrences in the stable lariat BP sequence set and applied T omT om ( 61 ) to compare and align results from STREME with kno wn tar get motifs of RNA binding proteins (RBPs) and micro-RNAs ( 62 , 63 ). To evaluate motifs depleted in branch point loci of stable lariats, we ran XSTREME with the background Taggart BP loci as the positi v e set and the BP loci of stable lariats as the negati v e set. Finally, a simple permutation scheme of the pooled sequences was used to generate ten randomly shuffled positi v e and negati v e sets used as input for XSTREME. All motifs were visualized by XSTREME using standard logo plots with bit sizes based on the Shannon entropy.

Intron read-depth
We found the position-wise read-depth for the introns using mapped reads from all the NMIBC samples ( n = 457). First, we only considered reads ma pping directl y to intronic loci and removed all linearly-spliced reads mapping across introns using their CIGAR strings. Subsequent, we selected reads mapping at least partially to the non-overlapping intron annotations defined above. Second, we summarized the read-depth for each sample using DeepTools's bamCoverage ( 64 ) and then summed the read-depth across all samples using UCSC's application bigWigMerge ( 65 ) and their tool bedGraphToBigWig was used for file conversion as well as the tool "bwtools extract" ( 66 ). We averaged the positionwise read-depths per sample (Supplementary Methods).
A similar approach was used to obtain the read-depth of all non-filtered reads from the NMIBC samples.

Intron sequence conservation
We evaluated the intronic sequence conservation using positional phyloP scores ( 67 ) from the 100-way vertebrate alignments downloaded from the UCSC Genome Browser ( http://hgdownload.soe.ucsc.edu ). The intronic positionwise phyloP scores were extracted using "bwtool extract". We defined the whole intron conservation as the mean position-wise conservation across the entire intron region, with 10 bp at splice site ends removed (Supplementary Methods).

Identification and quantification of circular sisRNAs
To quantify the expression of circular sisRNAs in bladder cancer, we used the CIRCexplorer2 pipeline ( 37 ) to identify reads that traverse the splicing junction (Figure 1 A) in total RNA-Seq data from a large cohort of patients ( n = 457; UROMOL) with NMIBC ( 35 ). We initially identified and characterized 11 420 circular junctions supported by at least one junction-spanning read, which we later used to define a restricted set of circular sisRNAs. For data set ov ervie ws and data processing see Supplementary Figure S2. Overall, we observed a small fraction of junction-spanning reads, ranging from barely detectable to 7.7 reads per million (RPM) (median = 1.4 RPM), with a large variation between samples (Figure 1 B). While most predicted junctions are supported by few junction-spanning reads (96% by less than fiv e r eads in total; Supplementary Figur e S3A), some are highly supported, e.g., junctions deri v ed fr om intr ons of HNRNPK (155 reads across 91 samples) and WDR13 (135 reads across 93 samples) (Supplementary Figure S3B).
The RT often mis-incorporates nucleotides while traversing the 2'-5' linkage, leaving uncertainty about the specific branch point location ( 11 , 14 , 68 , 69 ). Consistent with these studies, we found that many predicted junctions (31%) are annotated with the same 5' splice site but differ in their branch point annotations (Figure 1 C), usually only by a few bases (Supplementary Figure S3C). In the most extreme case, 49 predicted junctions had a shared 5' splice site but mapped to distinct branch points. On the other hand, few junctions (0.4%) shared the same branch point while differing in their 5' splice sites (Supplementary Figure S3D).
Predicted junctions from the same intron with the same 5' splice site but differing BP coordinates likely r epr esent the same potential circular RNA due to RT errors. They wer e ther efor e collapsed and r epr esented by the junction with the highest number of supporting reads (Figure 1 D,  Supplementary Figure S2). Based on these collapsed junctions, w e here by identified 9216 junction clusters. We use the term cluster as some branch points differ widely in position and potentially r epr esent alternati v e BP usage (Supplementary Figure S3C). One example is sisHNRNPK with BP positions that differ with up to 254 bp ( Supplementary Figure S4A). Howe v er, as these potential sisRNAs are deri v ed from overlapping loci, they have a high sequence similarity and are likely deri v ed through a similar splicing e v ent.
We compared the junction clusters detected by CIRCex-plorer2 with ones detected by circRNA Finder, DCC, and CIRIquant. We found that approx. 90% of potential circular sisRN A clusters overla p with at least one other detection method (Supplementary Figure S3E). Prominent circular sisRNAs detected by multiple methods include sisH-NRNPK, sisWDR13, sisCNOT6, sisSLK and sisMBNL1 (Supplementary Table S2). In contrast to the other methods, CIRCexplorer2 is the only method, which accounts for intronic lariat biogenesis, and we found it had the highest agreement with intronic splicing and known splice sites. CIRCexplorer2 is designed to handle the intronic junctionreads through a re-alignment step ( 37 ).
Finally, the circular sisRNA (cluster) is defined as being deri v ed fr om pr otein-coding genes, between 30 and 10 000 bp long, and expressed in at least two samples ( n = 1827) (Supplementary Figure S2, Supplementary Tables S2-S4). Of these, the majority were categorized as stable lariats (55%) and the rest as intronic circles (45%) that span the entire intron (Figure 1 A, E). In concordance with stable lariats containing a 2'-5' linkage and intronic circles proposed as having a 3'-5' circularizing junction, stable lariats have more uncertainty at their 3' end (mean of 2.38 predicted junctions) than intronic circles (mean of 1.68 predicted junctions) (Figure 1 Table S2). In contrast, the highest expressed intronic circle, originating from WDR13 (sisWDR13; chrX:48 598 958-48 599 352, Supplementary Figure S4B), has two predicted junctions (differing by one nucleotide, providing less ambiguity about the 3' end), and is supported by the second highest number of junction-spanning reads ( n = 136) across almost a hundred samples ( n = 94) (Figure 1 F, G, Supplementary Table S2).

Validation of cir cular -junctions and select circular sisRNAs
To increase confidence in the validity of circular junctions detected by CIRCexplorer2, we investigated the enrichment of circular junction-spanning reads upon RNase R treatment, w hich enzymaticall y degrades linear RN As ( 42 ). We only included circular junctions detected in both treated and untreated samples. Encouragingly, the majority (83.7-84.1%) of intronic circular junctions in human fibroblast (HS68) cell line data ( 42 ) were enriched upon RNase R tr eatment, which is mor e than the corr esponding linear splice-junctions (25.3-29.5%) (Figure 1 H). The association between enrichment and junction type (circular or linear) were statistically significant ( P -values < 2.2e-16; odds-ratio of 15.4 (replicate 1) and 12.2 (replicate 2); Fisher's exact test).
To confirm the authenticity of individual circular sisR-NAs, we attempted to experimentally validate se v eral of our findings (sisHNRNPK, sisWDR13, sisCNOT6 and sis-ARHGEF10L) in the human bladder cancer cell line J28 and human li v er cancer cell line HepG2 using RT-PCR and Sanger sequencing (Figure 1 I, Supplementary Figure  S5). Junction-spanning primers are shown in Supplementary Table S1. Both sisHNRNPK and sisCNOT were successfully detected by both methods, while the intronic circle sisARHGEF10L was detected using RT-PCR (Figure 1 I).
Although we could not detect sisWDR13 in our cell lines, this molecule has previously been validated in other data ( 14 ) (Figure 1 I). In addition, investigating total RNA-Seq J28 data showed little support for sisWDR13 expression in the cell line, which may explain the unsuccessful validation.

Circular sisRNAs and their host introns have distinct features
Se v eral features hav e been suggested to promote stability of circular sisRNAs, including length, 3' tail size, branch point nucleotide (nt), sequence context and enriched branch point motifs ( 11 , 12 , 18-21 , 24 ). We investigated the presence of these and other features across our set of circular sisR-NAs and their host introns (Figures 2 and 3 , Supplementary  Figures S6-S9). To avoid bias, we based the analyses on a set of non-e xon-ov erlapping introns found to be expressed in our data and disregarded alternative isoforms ( n = 153 582). We defined a set of high-confidence circular sisRNAs with 5' splice sites concordant with these ( n = 1558; Supplementary Figure S2).
We found that the length of intronic circles (median = 854 bp) and stable lariats (median = 982 bp) in bladder cancer are larger than previously reported for the cytoplasm of healthy tissues in vertebrates (majority between 100 and 500 bp) ( 12 ), in porcine tissue ( > 95% of intronic circRNAs had a length < 600 bp) ( 15 ), and in cell line data ( 19 ) (Figure 2 A). For both sisRNA classes, we compared the lengths of circular sisRNA host introns to the lengths of conventional unstable introns from the same genes (r eferr ed to as same-gene non-host introns) as well as to the larger class of intr ons fr om non-parent genes. The same-gene non-host introns were used as a control with the same expression properties as circular sisRNA host introns, while the full set of intr ons fr om non-parent genes is a heterogeneous set with high expression variability. In all instances, the circular sis-RNA host introns were shorter than non-host introns ( qvalues < 6.4e-15, log fold change (FC) < -0.61; Wilco x on rank-sum test; due to multiple testing, we obtained q -values by Benjamini-Hochberg control of the false discovery rate (BH FDR controlled); Figure 2 B, Supplementary Figure S1 and S2). Short 3' SS to BP distances promote stability for sis-RNA in yeast ( 24 ) while stable lariats have been shown to be associated with lack of a 3' tail in other organisms ( 11 , 19 , 42 , 70 ). In our analysis, howe v er, the inferred 3' tail length (BP to 3' SS distance; median = 33 bp) of stable lariats is similar to conventional lariats (median = 27 bp), with intr onic-circle host-intr ons having the shortest tail lengths (median = 24 bp; Supplementary Figure S6A nucleotide frequencies also showed some change between stable lariats (48% canonical adenine ( 13 , 14 , 34 , 69 )) compared to same-gene non-host introns (66% adenine) (Supplementary Figure S6B). We used the previously annotated and comprehensi v e BP map generated by Taggart et al. ( 13 , 14 ) to assign BP positions for all our intron subsets, w hich is subsequentl y used for all branch point anal yses (Supplementary Figure S2A). Comparing the Taggart BPs with the CIRCexplorer2 BPs (only possible for stable lariats), we found a median distance of 0 bp (mean = 7.7 bp) and ∼52% of BPs had a distance < 20 bp (Supplementary Figure S6C). While the exact BP position is hard to recover due to low read-coverage and the nature of the 2'-5' junction, the adjacent BP region showed similar features using either CIRCexplorer2 or Taggart BP annotations (Supplementary Figure S6D).
Turning to the larger sequence context of BPs, we discovered that the degenerate BP consensus motif is diminished in the host introns of stable lariats but strongly present for intr onic circle host-intr ons and non-host intr on classes (Figure 3 A, Supplementary Figure S7 and S8A). Compared to other introns, the host introns of stable lariats have a high uracil (U) frequency adjacent to the branch point position (Figure 3 A, Supplementary Figure S7 and S8A). Lastly, the highly-conserved sequence motifs of both canonical splice sites are observed in all intron classes (Figure 3 A, B, Supplementary Figure S7 and S8A, B).
We further performed de novo motif discovery of enriched sequence motifs in the vicinity of stable lariat BPs using STREME ( 56 , 59 ), which re v ealed two significant motifs: 'AYA UUA UUAA U' and 'UUUAAAA' ( E -value < 0.05, Figure 4 , Supplementary Figur e S9A, B). Ten r eruns on permutated data did not yield any motifs of similar significance. The AU-rich and ele v en nucleotide long 'AYAU-U AUU AAU' motif is found in 43 (9.4%) introns hosting stable lariats and is 6.2x mor e pr evalent than in non-host introns. Inter estingly, the motif r esembles a r epetition of the degenerati v e core part of the human BP consensus motif, 5'-UNA-3', and may promote alternati v e BP usage ( 71 ).
The 'AYA UUA UUAA U' motif is most commonly found upstream of the BP position (median position = -24 bp from the BP) and once per sequence (96%) (Figure 4 ). The two enriched motifs do not match known RBP binding sites or miRNA target sites. Finally, we also considered significant motif underr epr esentation, which led to the finding of two depleted GC-rich motifs (Supplementary Figure  S9C, D). For all four motifs, no differential expression is observ ed between stab le lariats with and without motif occurr ence (Supplementary Figur e S9E). This could be caused by the high sparsity of the junction-read data or as a consequence of the motifs not relating to stability but potentially function.
Lastly, we investigated if the entire host introns of circular sisRNAs were highly expressed compared to nonhost introns (Figure 3 F). In general, the expression of circular sisRNA junctions correlated positi v ely with wholeintron expression across samples, with the circular junctions supported by many reads showing the strongest correlation (Supplementary Figure S10). Additionally, the average read-depth for all intron classes is e v enly distributed across the entire intron length, with a slight increase toward the ends (expected to be caused by pr e-mRNA r eads; Figure 3 C, Supplementary Figure S8C). For both circular sis-RNA classes, the relati v e host-intron e xpression is significantly higher than for the corresponding same-gene nonhost-introns ( q -values < 1.2e-32; log FC > 0.82; Wilco x on rank-sum test; BH FDR controlled, Figure 3 F). Thus, a cir cular sisRNA expr ession signal can be observed at the whole-intron le v el.
In total, two circular sisRNA host introns have a relati v e expression larger than one. They host sisPABPN1 (chr14:23 323 484-23 323 964, ratio = 3.82) and sisRCC1 (chr1:28 508 161-2 8508 618, ratio = 1.02). sisPABPN1 is an intronic circle deri v ed from an intron occasionally retained in the mature parent mRNA. sisRCC1 is a stable lariat deri v ed from an intron also giving rise to SNORA73B, which can partially explain the large intron expression signal.  Figure 2 . ( G ) The mean intron conservation (phyloP100Way score) distributions with the three quartiles enlarged. Same intron classes as in Figure 2 .

Introns hosting stable lariats have conserved sequences across vertebrate species
High cross-species sequence conservation for circular sis-RNA host introns may indicate functional importance.
In addition, position-wise conservation can help pinpoint functionally important loci. Not surprisingly, introns tend to have a low mean position-wise conservation ( ∼40% of all mean intron phyloP-scores are within 0.1 of zero, Figure  3 G). Interestingly, introns hosting stable lariats were found to have a significantly higher intron conservation compared to same-gene non-host introns ( q -value = 0.021, log FC = 2.58; Wilco x on rank-sum test, BH FDR controlled). Most notably, the gene MBNL1 hosts two highly abundant and highly conserved circular sisRNAs (chr3:152 455 578-152 456 060 and chr3:152 445 540-152 446 580; mean whole-intron conservation scores of more than two). The MBNL1 gene encodes an RN A-binding protein, w hich regula tes alterna ti v e mRNA splicing, and is known to be deregulated during myotonic dystrophy ( 72 , 73 ). Two other highly conserved cases are derived from two neighboring introns in HNRNPK (chr9:83 974 619-83 975 461 and sisHNRNPK; Supplementary Figure S4A). The high sequence conservation is a product of a se v eral hundred bp long and highly-conserved region overlapping the two introns.
Further, we examined the conservation le v els of intronic loci important for proper splicing (Figure 3 D, E, Supplementary Figure S8D). As expected for introns in general, the consensus motifs of both splice sites are highly conserved and the degener ative br anch point motif of non-host introns have an elevated conservation (Figure 3 D, E, Supplementary Figure S8D). Host introns of circular sisRNAs have a larger position-wise mean conservation than nonhost introns at almost all positions at the SS and BP loci, with the largest differences being observed at the BP region (71.3% of positions adjacent to the BP have log FC scores > 1 for stable lariats, Figure 3 E). Interestingly, the mean phyloP-score distribution across the BP loci of introniccircle host-introns resemble that of non-host introns rather than that of the stable lariat host-introns (only 35.6% of positions adjacent to the BP have log FC scores > 1 for intronic cir cles, Figur e 3 E). The high conservation le v els at the branch point region of introns hosting stable lariats may indica te tha t this region is of particular importance for their stability or function.

Circular sisRNAs have tissue-specific expression patterns
To investigate tissue-specific expression patterns of circular sisRNAs, we utilized the CIRCexplorer2 pipeline on 113 total RNA-Seq adult and fetal samples from 46 tissues from the ENCODE project ( 41 ). By this approach, a large set of circular sisRNAs ( n = 10 162) supported by at least two junction-spanning reads were profiled across all tissues (Supplementary Table S5, Supplementary Figure S2B). The circular sisRNA set detected in NMIBC tumors share limited overlap with the combined tissue set ( n = 215) and with the urinary bladder tissue in particular ( n = 34). The total e xpression le v el (ranging from 0.43 to 27.18 RPM) and the number of circular sisRNAs (ranging from 5 to 1690) are found to be highly tissue specific (Figure 5 A, B, Supplementary Table S6). The total expression is usually driven by few highly expressed circular sisRNAs, with the top 10% of circular sisRNAs contributing between 27% and 75% of the total tissue expression (Supplementary Figure S11A). In addition, the majority (54%) of circular sisRNAs are expressed in only one tissue type (Figure 5 C).
This generalizes the findings made by Zhang et al. ( 19 ), where some ciRNAs were found to be broadly expressed (ci-ankrd52), while others had a highly tissue and cell-type specific expression pattern in humans. Robic et al. also showed that the number of intronic circRNAs differed between tissues in bovine and porcine data ( 74 ). Here, we also detected 238 circular sisRNAs in at least ten different tissues, including sisHNRNPK expressed in 44 tissues as well as sisMBNL1 and ci-ankrd52 present in 38 different tissue types (Figure 5 A, B).
In general, all tissue types have a low fraction of intronic cir cles compar ed to stable lariats, with an intronic circle fraction of at most 20% (mean = 7.2%; Figure 5 A, Supplementary Table S6). The urinary bladder tissue has an intronic circle percentage of 5.7% (34 of 599 circular sisR-N As), w hich is roughl y ten times lower than the intronic cir cle per centage for the NMIBC samples (Figure 5 A, Figur e 1 E). Inter estingly, tissue samples from the central nervous system (CNS) tend to have the highest intronic circle fractions, with all CNS tissues having a fraction larger than  Table S6). The accumulation of intronic circles in the CNS is concordant with previous findings of circRNAs, including intronic circles in mice, tending to accumulate in slow-dividing cells and aging brain tissue across multiple species ( 16 , 39 , 75-77 ). CNS studies could thus be particularly well-suited for yielding insights into the biological functions of intronic circles.

Circular sisRNA are found in multiple cell fractions
Intronic lariats are produced during transcription and are typicall y thereafter ra pidl y degr aded by the debr anching enzyme (Dbr1) and exonucleases, a process taking place in the cell nucleus. Exportation of circular sisRNAs to the cyto-plasm can promote stability due to the physical separation of the circular RNA containing the 2'-5' junction and Dbr1 ( 11 , 12 ). We investigated circular sisRNA presence in the cytoplasm of fractionated cell lines originating from bladder cancer (T24 and HCV29) and T24 deri v ed lung metastasis (FL3), using previously generated local total RNA-Seq data ( 26 ). A small set of circular sisRNAs (n = 149) was detected across the nuclear and cytoplasmic fractions of the three cell lines (Supplementary Table S7A Table S7A).
Similar results were observed for the ENCODE cell lines K562 ( n = 11; 5630 circular sisRNAs identified) and HepG2 ( n = 4; 880 circular sisRNAs identified) ( 41 ) (Supplementary Figure S11B, C). For both K562 and HepG2, circular sisRNAs found in the cytosol have the lowest median lengths of 399 bp (K562) and 538 bp (HepG2) as well as the lowest corr esponding inter quartile range (IQR) (Supplementary Table S7B, C). This indicates that smaller intronic molecules tend to be exported to the cytoplasm more frequently or that they possess the highest stability in the cytosol. Finally, similar proportions of intronic circles were found across all cell fractions (within each cell line), with the BC cell lines having the highest percentage (Supplementary  Table S7).

sisHNRNPK and sisWDR13 expressions are correlated with clinical outcomes
To assess the potential clinical importance of circular sis-RNAs, we evaluated their expression in three major risk classes (RCs) of NMIBC with distinct molecular characteristics and clinical outcomes (class 1: low-risk tumors with mainly good prognosis, n = 96; class 2: high-risk tumors characterized by poor prognosis, n = 232; and class 3 tumors: low / intermediate risk tumors, n = 129) ( 35 ). Class 1 tumors showed the highest total circular sisRNA expression (median = 1.32 RPM), while class 2 (median = 0.65 RPM) and class 3 tumors (median = 0.47 RPM) had significantly lower e xpression le v els ( q < 6e-10; log FC > 0.75; Wilco x on Rank-Sum Test; BH FDR controlled, Figure 6 A). By conducting pairwise differ ential expr ession analyses between the three risk classes for the highly abundant circular sis-RNAs ( n = 97), we found that 9 cases were differentially expressed between class 1 and 2 tumors, and 16 between class 1 and 3 tumors ( q < 0.1; Wilco x on Rank-Sum Test; BH FDR controlled, Figure 6 B, Supplementary Figure S12A). No circular sisRNAs were differentially expressed between risk class 2 and 3.
The two most abundant circular sisRNAs, sisHNRNPK and sisWDR13, showed the most perturbed expression profiles between risk class 1 and 2, and 1 and 3 tumors (Figure 6 B, Supplementary Figure S12A). Both sisRNAs were significantly higher expressed in class 1 tumors than in class 2 and 3 tumors ( q < 5e-04; log FC > 1.2; Wilco x on Rank-Sum Test; BH FDR controlled, Figure 6 C). In addition, all pairwise comparisons between risk classes of both host intron and parent gene expression were significant ( q < 0.1 for all comparisons; Wilco x on Rank-Sum Test; BH FDR controlled, Supplementary Figure S12B, C), except for the sisHNRNPK-intron between class 1 and 2 ( q = 0.2, Wilco x on Rank-Sum Test; BH FDR controlled, Supplementary Figure S12B).
Risk class 2 has recently been shown to be further subdivided into subclasses 2a (highest progression rate) and 2b (higher immune cell infiltration and lower progression rate) ( 78 ). Curiously, the total circular sisRNA expression was lowest in risk class 2a, including the individual expression of sisHNRNPK and sisWDR13 ( Supplementary Figur e S13). Her e, RC 1 showed the highest cir cular sisRNA ex-pression, RC 2b and 3 showed an intermediate expression, while RC 2a had the lowest overall circular sisRNA expression (Supplementary Figure S13). The initial higher circular sisRNA expression in RC 2 compar ed to RC 3 (Figur e 6 A), may thus be dri v en by the subgroup 2b with an intermediate prognosis.
Based on the observations of circular sisRNA deregulation in subclasses of NMIBC with distinct biological characteristics and prognosis, we evaluated the prognostic potential of circular sisRN As, particularl y, sisHNRNPK and sisWDR13. Interestingly, we found that progr ession-fr ee survival correlated positively with the total circular sisRNA expression ( P = 1.4e-04) as well as the expression of both sisHNRNPK ( P = 0.01) and sisWDR13 ( P = 0.05) (Log-Rank Test, Figure 6 D, E). While the sisWDR13 parent gene and its host intron yielded similarly significant results ( P < 0.01, Log-Rank Test, Supplementary Figure S14), the expression of the HNRNPK gene or the sisHNRNPK-intron were not able to distinguish patients with good prognosis from patients with poor prognosis ( P > 0.34, Log-Rank Test, Supplementary Figure S14). Looking at other prominent circular sisRNAs, we found no significant correlation with progr ession-fr ee survival. This could either be because the sisRNA expression could not distinguish between patient groups with high and low progr ession-fr ee survival rates (sisCNOT6, P -value = 0.97; sisFLOT1, P -values = 0.94) or potentially because we did not have the statistical power to detect differences caused by low sisRNA resolution and unbalanced group-sizes (sisARHGEF10L, Pv alue = 0.052; sisMBNL1, P -v alues = 0.22). sisSLK had a negati v e correlation with progr ession-fr ee survival ( P = 0.028), howe v er, similar results were also found for the parent gene expression.

DISCUSSION
In this study, we comprehensi v ely characterized the overall landscape of circular sisRNAs in early stage bladder cancer. We identified two distinct classes of circular sisR-NAs, i.e. stable lariats and intronic circles, which localize to both nuclear and cytoplasmic compartments and show tissue specific expression patterns. We found that circular sis-RNA host introns are short introns, which contain canonical splice site motifs, and are higher expressed than other intr ons fr om the same gene. The stab le lariats are deri v ed from conserved introns with an 'AU'-rich motif enriched upstream of the BP position, and an overall high uracil frequency at the BP region. In contrast, intronic circles are deri v ed from lowly conserved introns with BP loci similar to conventional introns. In addition, we showed that circular sisRNAs are differentially expressed between prognostic bladder cancer risk classes and that the total circular sis-RNA expression as well as individual cases correlate with progr ession-fr ee survival.
We have shown that circular sisRNAs can be studied and novel cases identified through analysis of large sets of total RNA-Seq samples, despite low e xpression le v els. Future total RNA-Seq data sets with higher read-depth and more samples, will improve the detection power and resolution, and enable further delineation of regulatory and functional hypotheses ( 79 , 80 ). During this study, we chose a relati v ely lenient read-support le v el for circular sisRNA junctions, which should be present in at least two samples. We found a sensitivity approach was needed to comprehensively detect rare circular junction reads, particularly, for the 2'-5' traversing reads. Selecting circular sisRNAs with higher support may remove true positives with low expression performing local gene regulation, as suggested by Zhang et al. ( 19 ).
Large-scale human BP mapping efforts recently led to the discovery of 3'-5' intronic circles, with 3% of all inferred BPs mapping to the 3' SS position ( 14 ). Interestingly, we found that ∼45% of circular sisRNA in bladder cancer are intronic circles spanning the entire introns, which is a vastly larger fraction than in normal tissue. This could be caused by perturbations of intronic circle synthesis or transcript degradation in cancer, with similar deregulation being observed for NAR Cancer, 2023, Vol. 5, No. 3 13 other ncRNA species ( 27 , 81 ). Some circular sisRNA clusters contained junction-reads supporting both stable lariats and intronic circles (e.g., sisPAPOLA and sisCLK1; Supplementary Table S2), which may further support the hypothesis that intronic circles are deri v ed from a lariat intermediates through a post-transcriptional e v ent as suggested by Taggart et al. ( 13 ).
We characterized features suggested to reduce the lariatdebranching efficiency of Dbr1 and thus affect circular sis-RNA stability ( 11 , 12 , 18 , 19 , 21 , 24 ). The exact mechanisms providing stability may differ between the sisRNA subclasses. For intronic circles, the expected 3'-5' junctions are inaccessible by Dbr1 and thus inhibits regulation and degradation through the conventional lariat turnover pathway ( 82 ).
In contrast, stable lariats r equir e other methods to escape debr anching. Inter action between the lariat substr ate and the acti v e site of Dbr1 depends on the existence of a 3' tail ( 82 , 83 ). In yeast, a small BP to 3' SS distance has been shown to be sufficient in stabilizing some sisRNAs ( 24 ). We hypothesized that a similar mechanism may be present in humans due to the association between stable lariats and at least partial loss of the 3' tail ( 5 , 11 , 19 ). A shorter initial 3' tail length may imply a quicker degradation by exonucleases and an increased stability of the intronic lariat. Howe v er, no such trend could be observed in our data and the digestion of the 3' tail may simply be a product of stable lariats persisting sufficiently long for nucleases to act on them.
The stable lariats had a slightly lower fraction of the canonical branch point nucleotide adenine (A) and a reduced canonical BP consensus motif ( 14 , 34 , 69 ). Cytosine (C) has previously been found to be a frequent BP nucleotide of stable lariats in multiple species including human, mouse, pigs and zebrafish cells and is known to reduce the debranching efficiency of Dbr1 ( 12 , 18-21 ). We also observed an increased usage of C at the BP of stable lariats, howe v er, not to the degr ee pr eviously r eported. This may in part be due to the different data preparation procedures, including absence of RNase R treatment leading to a generally lower read-depth at the circularizing junction and a lower BP resolution.
The Taggart BPs annotated to stable lariats generally differed slightly from those deri v ed with CIRCexplorer2. This may partly be caused by processing differ ences, wher e Taggart et al. used a U2snRN A:pre-mRN A base-pairing model to infer the true BP position in place of RT skipping ( 14 ). Howe v er, differing BP coor dinates may also be a product of alternati v e BP usage across tissues ( 84 ), or between healthy and cancer tissue ( 85 , 86 ). Thus, a focus on the BP loci rather than the exact BP positions, is expected to produce the most robust results. RNase R treatment of samples in future studies may further help resolve the BP loci of known circular sisRNAs ( 11 , 12 , 40 , 42 , 70 , 79 ), which could gi v e nov el insights into how they obtain stability and potential disruption in cancer.
We found that circular sisRNAs and their host introns are relati v el y short, w hich might affect debranching by Dbr1 or increase their chance of being exported to the cytoplasm ( 11 , 12 , 19 ). Correspondingly, we detected circular sisRNAs in the cytoplasm of human cell lines, which were significantly shorter than in other cell fr actions. Similar ly, cyto-plasmic circular sisRNAs hav e pre viously been detected in zebrafish, frog, chicken, mouse, and human cells, and were deri v ed from short introns and suggested to have a regulatory role in mRNA translation ( 11 , 12 ). Our results thus further support that a subset of circular sisRNAs may perform their function in the cytoplasm.
Sequence conservation, which implies purifying selection to preserve function, was significantly higher for introns hosting stable lariats than same-gene non-host introns, with highly conserved cases including sisHNRNPK and sisMBNL1. This may be attributable to sequence elements important for the stability and function of circular RNAs. The position-wise mean conservation immediately upstr eam and downstr eam of the BP wer e se v eral or ders of magnitude higher than for the non-host intron classes. This suggests that conserved sequence motifs at the BP locus may play a vital role in intronic lariat stability or function. Inter estingly, similar r esults wer e not f ound f or the intronic circles, indica ting tha t the eleva ted sequence conserva tion is unique for the stable lariats and their BP regions.
Circular sisRNAs may have potential as prognostic biomarkers in bladder cancer, similar to previous findings for exonic circular RN As (circRN As) ( 26 , 27 ). Our results show that the overall expression of circular sisRNAs differ between prognostic risk classes. Deregulation of circular sisRNAs may in principle function as a marker for cancer progression, as patients with low circular sisRNA expression have a poorer prognosis and the expression correlates with progr ession-fr ee survi val. In particular, the e xpression of sisHNRNPK positi v el y correlates with pro gr ession-fr ee survival, while the expression of the parent gene, HNRNPK , does not. sisHNRNPK is highly conserved and has different characteristics compared to other stable lariats and thus appears to obtain stability through other means. HNRNPK is a well-described DN A / RN A binding protein involved in pre-mRN A processing, including RN A splicing ( 87 ). Furthermore, HNRNPK is involved in the tumorigenesis of multiple cancer types (88)(89)(90)(91)(92), with specific lncRNA association of hnRNP-K previously shown to either promote or inhibit self-renewal in bladder-cancer stem-like cells ( 88 , 93 , 94 ). HNRNPK expression has also been shown to be upregulated in bladder cancer and correlate with poor clinical outcome in patients ( 92 ). We hypothesize that sisHNRNPK ma y pla y an autoregulatory function to finetune the transcription le v el of its parent gene needed for normal cell function, as negati v e feedback mechanisms have evolved repeatedl y for RN A binding proteins ( 95-100 ) and autoregulation has previously been described for ci-ankrd52 in HeLa cells and sisR-1 in Drosophila Melanogaster ( 19 , 23 ).
In order to produce a high-confidence circular sisRNA set, we filtered out circular sisRNAs originating from ncRNA genes (e.g., MALAT1) and having a mitochondrial origin. Howe v er, recent studies indicate that some of these circular junctions could have the potential to r epr esent true circular sisRNAs and could be a point of further study ( 15 , 16 ).
In summary, we have profiled the expression of circular sisRNAs across healthy tissues and in bladder cancer, and evaluated their biological characteristics and clinical associa tions. Our stud y further identified individual circular sisRNAs that stand out in terms of expression levels, evolutionary conservation, or clinical correlations, including sisHNRNPK, sisWDR13, and sisMBNL1, which may act as candidates for the experimental functional characterisation of circular sisRNAs and their role in cancer.

DA T A A V AILABILITY
The NMIBC RNA-Seq data was initially published by Hedegaard et al. and is available under controlled access. The access process can be initiated by contacting Lars Dyrskjøt (lars@clin.au.dk). As a consequence of privacy laws, data will only be available following new approvals by ethical committees and data protection agencies.
The RNase R treated sequencing data can be found at the Short Read Archi v e (SRA050270). Branch point tab les fr om hg19 intr ons are available as supplementary data for Taggart et al. ( http://fairbr other.biomed.br own.edu/data/ Lariat2016/ ). The total RNA-Seq tissue data are accessible thr ough the ENCODE pr oject ( https://www.encodepr oject. org/ ; see Supplementary File S3). Fractionated K562 and HepG2 cell line data are also available through ENCODE (see Supplementary File S1-2). Total RNA sequenced data from the bladder cancer cell lines and fractionated cell lines are deposited in NCBI's Gene Expression Omnibus (GSE100971).